Reinforcement Learning Policy Approximation by Behavior Trees

نویسندگان

  • Y. S. Janssen
  • Kirk Scheper
چکیده

Traditionally a Reinforcement Learning (RL) policy is stored in a lookup table. From such a table it is difficult to observe the behavioral logic or manually adjust this logic post-learning is difficult. This paper shows how behavioral logic of a RL controller is presented in an insightful manner and can be adjusted using the Behavior Tree (BT) framework. It shows a method to approximate an RL policy using a Genetic Algorithms (GA) for BTs for a guidance task carried out by an UAV navigating in an unknown environment. The method shows how Discrete Time Markov Chains (DTMCs) can be used to increase optimization speed. The execution of the RL controller that interacts with the environment is mapped to a DTMC, which results in a representation of the one-step transition matrix. The GA evaluates BTs by transitioning through this DTMC instead of the simulation environment. Without the need to simulate a computationally expensive environment, the optimization is performed faster. The method is demonstrated on a UAV simulation using a Q-learning algorithm to learn a guidance task. The guidance task consists of an avoidance behavior and goal-seek behavior. The evolved BT, containing 6 nodes, successfully identifies 66% of the correct actions of the RL policy for all states in the Q-table. When this BT is run in the simulation environment it results in a success rate of 93%. After adaptation by the experimenter the success rate for all states in the Q-table increases to 86% and in simulation to 96%. For the verification of the avoidance behavior with the evolved BT only 3 nodes need to be verified, compared to 1448 state-action pairs of the Q-table.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tree-Based Batch Mode Reinforcement Learning

Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt ,ut ,rt ,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the succe...

متن کامل

Convergence of reinforcement learning using a decision tree learner

In this paper, we propose conditions under which Q iteration using decision trees for function approximation is guaranteed to converge to the optimal policy in the limit, using only a storage space linear in the size of the decision tree. We analyze different factors that influence the efficiency of the proposed algorithm, and in particular study the efficiency of different concept languages. W...

متن کامل

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradien...

متن کامل

Near Optimal On-Policy Control

We introduce two online gradient-based reinforcement learning algorithms with function approximation – one model based, and the other model free – for which we provide a regret analysis. Our regret analysis has the benefit that, unlike many other gradient based algorithm analyses for reinforcement learning with function approximation, it makes no probabilistic assumptions meaning that we need n...

متن کامل

Convergent Tree-Backup and Retrace with Function Approximation

Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. In this paper, we show that the Tree Ba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016